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Abstract 


We analyze in this paper a random feature map based on a theory of invariance 
(I-theory) introduced in [1]. More specifically, a group invariant signal signature 
is obtained through cumulative distributions of group-transformed random pro¬ 
jections. Our analysis bridges invariant feature learning with kernel methods, as 
we show that this feature map defines an expected Haar-integration kernel that is 
invariant to the specihed group action. We show how this non-linear random fea¬ 
ture map approximates this group invariant kernel uniformly on a set of N points. 
Moreover, we show that it defines a function space that is dense in the equivalent 
Invariant Reproducing Kernel Hilbert Space. Finally, we quantify error rates of 
the convergence of the empirical risk minimization, as well as the reduction in the 
sample complexity of a learning algorithm using such an invariant representation 
for signal classification, in a classical supervised learning setting. 

1 Introduction 

Encoding signals or building similarity kernels that are invariant to the action of a group is a key 
problem in unsupervised learning, as it reduces the complexity of the learning task and mimics how 
our brain represents information invariantly to symmetries and various nuisance factors (change in 
lighting in image classification and pitch variation in speech recognition) [1,2, 3, 4]. Convolutional 
neural networks [5, 6] achieve state of the art performance in many computer vision and speech 
recognition tasks, but require a large amount of labeled examples as well as augmented data, where 
we reflect symmetries of the world through virtual examples [7, 8] obtained by applying identity¬ 
preserving transformations such as shearing, rotation, translation, etc., to the training data. In this 
work, we adopt the approach of [1], where the representation of the signal is designed to reflect 
the invariant properties and model the world symmetries with group actions. The ultimate aim is 
to bridge unsupervised learning of invariant representations with invariant kernel methods, where 
we can use tools from classical supervised learning to easily address the statistical consistency and 
sample complexity questions [9, 10]. Indeed, many invariant kernel methods and related invariant 
kernel networks have been proposed. We refer the reader to the related work section for a review 
(Section 5) and we start by showing how to accomplish this invariance through group-invariant Haar- 
integration kernels [11], and then show how random features derived from a memory-based theory 
of invariances introduced in [1] approximate such a kernel. 

1.1 Group Invariant Kernels 

We start by reviewing group-invariant Haar-integration kernels introduced in [11], and their use in a 
binary classification problem. This section highlights the conceptual advantages of such kernels as 
well as their practical inconvenience, putting into perspective the advantage of approximating them 
with explicit and invariant random feature maps. 
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Invariant Haar-Integration Kernels. We consider a subset X of the hypersphere in d dimensions 
Let px be a measure on X. Consider a kernel ko on X, such as a radial basis function kernel. 
Let G be a group acting on X, with a normalized Haar measure /i. G is assumed to be a compact 
and unitary group. Define an invariant kernel 1C between x,z G X through Haar-integration [11] as 
follows: 


IC{x,z)= / / ko{gx,g'z)dp.{g)dp{g'). (1) 

Jg Jg 

As we are integrating over the entire group, it is easy to see that: IC{g'x, gz) = ]C{x, z), Vp, g' G 
G, Vx, z G X. Hence the Haar-integration kernel is invariant to the group action. The symmetry of 
/C is obvious. Moreover, if fcp is a positive definite kernel, it follows that /C is positive definite as 
well [11]. One can see the Haar-integration kernel framework as another form of data augmentation, 
since we have to produce group-transformed points in order to compute the kernel. 


Invariant Decision Boundary. Turning now to a binary classification problem, we assume that we 
are given a labeled training set: S = {{xi,yi) \ Xi G X,yi G y = {±1}}^^]^. In order to learn a 
decision function f : X ^ y, we minimize the following empirical risk induced by an L-Lipschitz, 
convex loss function C, with C'(0) < 0 [12]: min/g^^fy(/) := ^ C(?/i/(xi)), where we 

restrict / to belong to a hypothesis class induced by the invariant kernel /C, the so called Reproducing 
Kernel Hilbert Space TL/c- The representer theorem [13] shows that the solution of such a problem, 
or the optimal decision boundary has the following form: f’^ix) = ct*IC{x, Xi). Since the 

kernel /C is group-invariant it follows that : f%{gx) = ctilC{gx, Xi) = oti]C{x, Xi) = 

/^(x), Mg G G. Hence the the decision boundary /*is group-invariant as well, and we have: 
f*Nl9x) = fN{x)yg G G,Vx e A”. 

Reduced Sample Complexity. We have shown that a group-invariant kernel induces a group- 
invariant decision boundary, but how does this translate to the sample complexity of the learning 
algorithm? To answer this question, we will assume that the input set X has the following structure: 
X = X[j\J GXq, QXq = {z\z = gx, X G Xq, g G G/ {e}}, where e is the identity group element. 
This structure implies that for a function / in the invariant RKHS Hic, we have: 

Vz £ GXq, 3 X £ To, 3 g G G such that, z = gx, and /(z) = /(x). 

Let Py{x) = P(y = y\x) be the label posteriors. We assume that Py{gx) = py{x),\/g G G. This 
is a natural assumption since the label is unchanged given the group action. Assume that the set X 
is endowed with a measure px that is also group-invariant. Let / be the group-invariant decision 
function and consider the expected risk induced by the loss V, £v{f), defined as follows: 

£v{f)= [ '^V{yf{x))py{x)px{x)dx, (2) 


£v{f) is a proxy to the misclassification risk [12]. Using the invariant properties of the function 
class and the data distribution we have by invariance of /, py, and p: 


£v{f) 



Hence, given an invariant kernel to a group action that is identity preserving, it is sufficient to 
minimize the empirical risk on the core set Xq, and it generalizes to samples in GXq. 

Let us imagine that X is finite with cardinality \X\-, the cardinality of the core set Xq is a small 
fraction of the cardinality of X: \Xq\ = a\X\, where 0 < a < 1. Hence, when we sample training 
points from Xq, the maximum size of the training set is W = a\X\ << \X\, yielding a reduction in 
the sample complexity. 
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1.2 Contributions 


We have just reviewed the group-invariant Haar-integration kernel. In summary, a group-invariant 
kernel implies the existence of a decision function that is invariant to the group action, as well as 
a reduction in the sample complexity due to sampling training points from a reduced set, a.k.a the 
core set Afg. 

Kernel methods with Haar-integration kernels come at a very expensive computational price at both 
training and test time: computing the Kernel is computationally cumbersome as we have to integrate 
over the group and produce virtual examples by transforming points explicitly through the group 
action. Moreover, the training complexity of kernel methods scales cubicly in the sample size. 
Those practical considerations make the usefulness of such kernels very limited. 

The contributions of this paper are on three folds: 

1. We first show that a non-linear random feature map ^ : Af derived from a memory- 

based theory of invariances introduced in [1] induces an expected group-invariant Haar- 
integration kernel KT. For fixed points cc, z G T", we have: E (<i>(x), $( 2 ;)) = K{x,z), 
where K satisfies: K{gx, g'z) = K{x, z),'ig, g' G G,x, z G X. 

2. We show a Johnson-Lindenstrauss type result that holds uniformly on a set of N points that 
assess the concentration of this random feature map around its expected induced kernel. For 
sufficiently large D, we have ($(x), $( 2 :)) ~ K{x, z), uniformly on an N points set. 

3. We show that, with a linear model, an invariant decision function can be learned in this 
random feature space by sampling points from the core set Xg i.e: /^(x) ~ {w*, $(x)) 
and generalizes to unseen points in QXg, reducing the sample complexity. Moreover, we 
show that those features define a function space that approximates a dense subset of the 
invariant RKHS, and assess the error rates of the empirical risk minimization using such 
random features. 

4. We demonstrate the validity of these claims on three datasets: text (artificial), vision 
(MNIST), and speech (TIDIGITS). 

2 From Group Invariant Kernels to Feature Maps 

In this paper we show that a random feature map based on I-theory [1]: $ : T” —approximates 
a group-invariant Haar-integration kernel K having the form given in Equation (I): 

($(x),$( 2 ;)) Ri K{x,z). 

We start with some notation that will be useful for defining the feature map. Denote the cumulative 
distribution function of a random variable X by, 

Fx{t)=¥{X <t), 

Fix X G X, Let g G Ghe a random variable drawn according to the normalized Haar measure g and 
let f be a random template whose distribution will be defined later. For s > 0, define the following 
truncated cumulative distribution function (CDF) of the dot product (x, gt): 

'tjj{x,t,T) = Fg{{x,gt) < r) = r G [-s,s],x S X, 

Let e G (0,1). We consider the following Gaussian vectors (sampling with rejection) for the tem¬ 
plates t: 

t = n ^ Af j if Il’^ll2 < 1 + £, f =-L else . 

The reason behind this sampling is to keep the range of (x, gt) under control: The squared norm 
||n ||2 will be bounded by 1 -f e with high probability by a classical concentration result (See proof 
of Theorem 1 for more details). The group being unitary and x G we know that: I (x, at) I < 

||n||2 < ym< l-f£, fores (0,1). 

Remark 1. We can also consider templates t, drawn uniformly on the unit sphere Uniform 

templates on the sphere can be drawn as follows: 

t = v^Af{0,Id), 

IWh 
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since the norm of a gaussian vector is highly concentrated around its mean s/d, we can use the 
gaussian sampling with rejection. Results proved for gaussian templates (with rejection) will hold 
true for templates drawn at uniform on the sphere with different constants. 


Define the following kernel function, 

Ks{x,z) = Et ■f{x,t,T)'f{z,t,T)dT, 

J — s 

where s will be fixed throughout the paper to be s = 1 + e since the gaussian sampling with rejection 
controls the dot product to be in that range. 

Let g £ G. As the group is closed, we have 'ijj{t,gx,T) = jQJ^ggx.t)<Tdp{g) = 
Ig ^{gx,t)<Tdp.{g) = tp{t,x,T) and hence Ks{gx,g'z) = Ks{x,z), for all g,g' £ G. It is clear 
now that iL is a group-invariant kernel. 

In order to approximate Kg, we sample IGj elements uniformly and independently from the group 
G, i.e. gi,i = \ ... |G|, and define the normalized empirical CDF : 


|G| 


(j){x,t,T) = 


\G\s/m 


^ ^ ^{git,x)<.T: S T S. 


We discretize the continuous threshold r as follows; 


sk' 

x,t, — ] = 
n 




|G| 


/nm|G| 


'^^{git,x)<j;k, -n<k<n. 


We sample m templates independently according to the Gaussian sampling with rejection, tj , j = 
1... m. We are now ready to define the random feature map <I>; 

sfc' 


<h(a;) = 


(j) I X, tj , 


T,+l)xr 


j=l...m,k——n...n 


It is easy to see that; 

/ sk\ ( sk\ 

lim Et.g ($(a:),$(z))jj( 2 „+i)xm = lim Et,g V V f I x,tj, — ] f ( z,tj, — ] = Kg(x,z). 

n—>oo n-^-oo ^^ ^^ \ TL J \ Tl J 

3=1k=-n ^ ^ ' 

In Section 3 we study the geometric information captured by this kernel by stating explicitly the 

similarity it computes. 

Remark 2 (Efficiency of the representation). 1 ) The main advantage of such a feature map, as 
outlined in [1 ], is that we store transformed templates in order to compute <I>, while if we wanted 
to compute an invariant kernel of type K, (Equation (1)), we would need to explicitly transform 
the points. The latter is computationally expensive. Storing transformed templates and computing 
the signature $ is much more efficient. It falls in the category of memory-based learning, and is 
biologically plausible [1 ]. 

2) As \G\,m,n get large enough, the feature map $ approximates a group-invariant Kernel, as we 
will see in next section. 


3 An Equivalent Expected Kernel and a Uniform Concentration Result 


In this section we present our main results, with proofs given in the supplementary material. Theo¬ 
rem 1 shows that the random feature map $, defined in the previous section, corresponds in expec¬ 
tation to a group-invariant Haar-integration kernel Ks{x, z). Moreover, s — Ks{x, z) computes the 
average pairwise distance between all points in the orbits of x and z, where the orbit is defined as 
the collection of all group-transformations of a given point x : Ox = {gx, g £ G}. 

Theorem 1 (Expectation). Let e £ (0,1) and x,z £ X. Define the distance do between the orbits 
Ox and Oz.' 

dcix, z) = -^= \\gx - p'z ||2 dp{g)dfj.{g'), 

and the group-invariant expected kernel 


Ks(x,z)= lim Et,g ($(a;), $(z))R( 2 „+i)xm = Et / 'ilj{x,t,T)ijj(z,t,T)dT, s = 1-h £. 

n-!-oo 
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( 3 ) 


1. The following inequality holds with probability 1: 

e - 62{d, s) < Ks{x, z) - (1 - dcix, z)) < e + di{d, e), 

where 5i(e, d) = ^ ^ ^ — '' and 62 ( 8 , 6 ) = ^ ^ - 1 " (1 + 

2. For any e G (0,1) as the dimension d ^ 00 we have (5i(e, d) —^ 0 and 62 ( 6 , d) —>■ 0, and 
we have asymptotically Ks{x, z) ^ \ — dcix^z) + e = s — dcix, z). 


3. Kg is symmetric and Kg is positive semi-definite. 

Remark 3. 1) e, Si{d, e), and ( 52 (d, s) are not errors due to results holding with high probability 
but are due to the truncation and are a technical artifact of the proof 2) Local invariance can be 
defined by restricting the sampling of the group elements to a subset Q <Z G. Assuming that for each 
g G G,g~^ G G, the equivalent kernel has asymptotically the following form: 


Kg{x,z) 



g'z\\.^dp.{g)dp.{g'). 


3) The norm-one constraint can be relaxed, let R = sup 2 ,g;^- 113^112 ^ hence we can set s = 
R{1 + e), and 

-S 2 (d,s) < Kg(x,z) - (R(l + e) - dG(x,z)) < Si(d,e), (4) 


where di(e, d) = R 


e-rfe^/16 

y/d 


R e '^‘^/^(l+£)2 
2 y/d 


and 62 ( 5 , S) = R 


y/d 


+ i?(l + 


Theorem 2 is, in a sense, an invariant Johnson-Lindenstrauss [14] type result where we show that 
the dot product defined by the random feature map $ , i.e ($(a;), ^(-z)), is concentrated around the 
invariant expected kernel uniformly on a data set of N points, given a sufficiently large number of 
templates m, a large number of sampled group elements IGj, and a large bin number n. The error 
naturally decomposes to a numerical error £0 and statistical errors £ 1 , £2 due to the sampling of the 
templates and the group elements respectively. 

Theorem 2. [Johnson-Lindenstrauss type Theorem- N point Set] Let R = {xi \ Xi G 
be a finite dataset. Fix eo,ei,£ 2 , 61,62 G (0,1). For a number of bins n > templates m > 
^ log(|^), and group elements \G\> ^ log(^^^), where C\, C 2 are universal numeric constants, 
we have: 

|(T>(xi), - Kg{xi,Xj)\ < £0 + £1 + £ 2 , * = 1 ■ ■ ■ 5V, J = 1 ■ ■ ■ 5V, (5) 

with probability 1 — (5i — 82 . 


Putting together Theorems 1 and 2, the following Corollary shows how the group-invariant random 
feature map $ captures the invariant distance between points uniformly on a dataset of N points. 

Corollary 1 (Invariant Features Maps and Distances between Orbits). Let V = {xi \ Xi G 

be a finite dataset. Fix £ 9 , (5 G (0,1). For a number of bins n > templates m > log(^), 

and group elements |G| > log(-^^), where Gi, C 2 are universal numeric constants, we have: 

£ - 52{d,e) - £0 < {^{xi),^(xj)) - (1 - dG(,Xi,Xj)) < £0 + £ + (5i(d,£), (6) 

i = l...N,j = l...N, with probability 1 — 25. 

Remark 4. Assuming that the templates are unitary and drawn form a general distribution p{t), the 
equivalent kernel has the following form: 

Kg{x, ^)= dp,{g)dp.{g') (^J ^ - max{{x, gt ), (z, g't))p{t)dt^ . 


Indeed when we use the gaussian sampling with rejection for the templates, the integral 
f max{{x, gf), {z, g'f))p{t)dt is asymptotically proportional to g~^x — g'~^z . It is interesting 

to consider different distributions that are domain-specific for the templates and assess the number 
of the templates needed to approximate such kernels. It is also interesting to find the optimal tem¬ 
plates that achieve the minimum distortion in equation 6 , in a data dependent way, but we will 
address these points in future work. 
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4 Learning with Group Invariant Random Features 


In this section, we show that learning a linear model in the invariant, random feature space, on a 
training set sampled from the reduced core set Aq, has a low expected risk, and generalizes to unseen 
test points generated from the distribution on X = Aq U GXq. The architecture of the proof follows 
ideas from [15] and [16]. Recall that given an L-Lipschitz convex loss function V, our aim is to 
minimize the expected risk given in Equation (2). Denote the CDF by t, r) = P((pf, x) < r), 

and the empirical CDF by ip{x, t, t) ^{git,x)<T- Letp(f) be the distribution of templates 

t. The RKHS dehned by the invariant kernel Kg, Ks{x, z) = f '0(a;, t, T)'ip{z, t, T)p{t)dtdT 
denoted TLks ^ is the completion of the set of all hnite linear combinations of the form: 

f{x)='^aiKs{x,Xi),XiGX,aiGM.. ( 7 ) 

i 

Similarly to [16], we define the following infinite-dimensional function space: 

= |/( 2 ;) = J J 'w{t,T)lp{x,t,T)dtdT I sup< C 

Lemma 1. Xp is dense in Hk^- For f € Tp we have £v{f) = ^Xo '^yey ^ 
where Xq is the reduced core set. 


Since Xp is dense in we can learn an invariant decision function in the space Xp, instead 

of learning in T-Lks- Let 'k(a:) = ip (x,tj, . T*, and $ are equivalent up to 

L ' .i j=l...m,k——n...n 

constants. We will approximate the set Xp as follows: 


f = 



j—1k——n 




p,j = l...m 



Hence, we learn the invariant decision function via empirical risk minimization where we restrict 
the function to belong to X, and the sampling in the training set is restricted to the core set Xq. Note 
that with this function space we are regularizing for convenience the norm inhnity of the weights 
but this can be relaxed in practice to a classical Tikhonov regularization. 

Theorem 3 (Learning with Group invariant features). Let S = {{xi,yi) \ Xi G XQ,yi G 
y,i = 1.. .iV}, a training set sampled from the core set Xq. Let = argminjg_pfy(/) = 
L(yi/(a;i)).T'k <5 > 0, then 




(^4:LsC + 2V{0)+LC. 



2sLC 





with probability at least 1 — 3i5 on the training set and the choice of templates and group elements. 


The proof of Theorem 3 is given in Appendix B. Theorem 3 shows that learning a linear model 
in the invariant random feature space defined by <1> (or equivalently T*), has a low expected 
risk. More importantly, this risk is arbitrarily close to the optimal risk achieved in an inhnite- 
dimensional class of functions, namely Xp. The training set is sampled from the reduced core 
set Xq, and invariant learning generalizes to unseen test points generated from the distribution 
on X = Xq U QXq, hence the reduction in the sample complexity. Recall that Xp is dense in 
the RKHS of the Haar-integration invariant Kernel, and so the expected risk achieved by a linear 
model in the invariant random feature space is not far from the one attainable in the invariant 
RKHS. Note that the error decomposes into two terms. The hrst, 0{-^), is statistical and it 
depends on the training sample complexity N. The other is governed by the approximation error of 
functions Xp, with functions in X, and depends on the number of templates m, number of group 

elements sampled | G |, the number of bins n, and has the following form 0(.^)-|-0 j + n - 
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5 Relation to Previous Work 


We now put our contributions in perspective by outlining some of the previous work on invariant 
kernels and approximating kernels with random features. 

Approximating Kernels. Several schemes have been proposed for approximating a non-linear ker¬ 
nel with an explicit non-linear feature map in conjunction with linear methods, such as the Nystrom 
method [17] or random sampling techniques in the Fourier domain for translation-invariant kernels 
[15]. Our features fall under the random sampling techniques where, unlike previous work, we sam¬ 
ple both projections and group elements to induce invariance with an integral representation. We 
note that the relation between random features and quadrature rules has been thoroughly studied in 
[18], where sharper bounds and error rates are derived, and can apply to our setting. 

Invariant Kernels. We focused in this paper on Haar-integration kernels [11], since they have an 
integral representation and hence can be represented with random features [18]. Other invariant 
kernels have been proposed; In [19] authors introduce transformation invariant kernels, but unlike 
our general setting, the analysis is concerned with dilation invariance. In [20], multilayer arccosine 
kernels are built by composing kernels that have an integral representation, but does not explicitly 
induce invariance. More closely related to our work is [21], where kernel descriptors are built for vi¬ 
sual recognition by introducing a kernel view of histogram of gradients that corresponds in our case 
to the cumulative distribution on the group variable. Explicit feature maps are obtained via kernel 
PC A, while our features are obtained via random sampling. Finally the convolutional kernel network 
of [22] builds a sequence of multilayer kernels that have an integral representation, by convolution, 
considering spatial neighborhoods in an image. Our future work will consider the composition of 
Haar-integration kernels, where the convolution is applied not only to the spatial variable but to the 
group variable akin to [2]. 


6 Numerical Evaluation 

In this paper, and specifically in Theorems 2 and 3, we showed that the random, group-invariant 
feature map $ captures the invariant distance between points, and that learning a linear model 
trained in the invariant, random feature space will generalize well to unseen test points. In this 
section, we validate these claims through three experiments. For the claims of Theorem 2, we 
will use a nearest neighbor classifier, while for Theorem 3, we will rely on the regularized least 
squares (RFS) classifier, one of the simplest algorithms for supervised learning. While our proofs 
focus on norm-infinity regularization, RFS corresponds to Tikhonov regularization with square 
loss. Specifically, for performing T— way classification on a batch of N training points in W^, 
summarized in the data matrix X G and label matrix Y G RFS will perform the 

optimization, min^y xt {^||y — $(X)IF|||,-|-A||IF|||.}, where 11 • 11 is the Frobenius norm, 
A is the regularization parameter, and $ is the feature map, which for the representation described 
in this paper will be a CDF pooling of the data projected onto group-transformed random templates. 
All RFS experiments in this paper were completed with the GURLS toolbox [23]. The three 
datasets we explore are: 

Xperm (Figure 1); An artificial dataset consisting of all sequences of length 5 whose elements 
come from an alphabet of 8 characters. We want to learn a function which assigns a positive value 
to any sequence that contains a target set of characters (in our case, two of them) regardless of their 
position. Thus, the function label is globally invariant to permutation, and so we project our data 
onto all permuted versions of our random template sequences. 

MNIST (Figure 2): We seek local invariance to translation and rotation, and so all random templates 
are translated by up to 3 pixels in all directions and rotated between -20 and 20 degrees. 

TIDIGITS (Figure 3): We use a subset of TIDIGITS consisting of 326 speakers (men, women, 
children) reading the digits 0-9 in isolation, and so each datapoint is a waveform of a single word. 
We seek local invariance to pitch and speaking rate [25], and so all random templates are pitch 
shifted up and down by 400 cents and warped to play at half and double speed. The task is 10-way 
classification with one class-per-digit. See [24] for more detail. 


Acknowledgements; Stephen Voinea acknowledges the support of a Nuance Foundation Grant. 
This work was also supported in part by the Center for Brains, Minds and Machines (CBMM), 
funded by NSF STC award CCF 1231216. 
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•■Raw -B-Haar ■&CDF(25,10) 
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o 
< 


10 100 1000 
Number of Training Points Per Class 


Number of Training Points Per Class 


•■Raw ■! 

t Bag-Of-Words 


■CDF(25,1) 


Figure 1: Classification accuracy as a function of training set size, averaged over 100 random 
training samples at each size. = CDF(n, m) refers to a random feature map with n bins and m 
templates. With 25 templates, the random feature map outperforms the raw features and a bag-of- 
words representation (also invariant to permutation) and even approaches an RLS classifier with a 
Haar-integration kernel. Error bars were removed from the RLS plot for clarity. See supplement. 



MNIST Accuracy RLS (1000 Points Per Class) 

Bins*5'^25 

1.0 


i 10 100 

Number of Templates 


MNIST Sample Complexity RLS 



Number of Training Points Per Class 


Figure 2: Left Plot) Mean classification accuracy as a function of number of bins and templates, 
averaged over 30 random sets of templates. Right Plot) Classification accuracy as a function of 
training set size, averaged over 100 random samples of the training set at each size. At 1000 exam¬ 
ples per class, we achieve an accuracy of98.97%. 


TIDIGITS Speaker RLS 
Bins*5’^25»100 

1.0 


0 . 9 - 



10 100 1000 
Number of Templates 


TIDIGITS Gender RLS 

Bins-»5*25*100 



Number of Templates 


Figure 3: Mean classification accuracy as a function of number of bins and templates, averaged 
over 30 random sets of templates. In the “Speaker” dataset, we test on unseen speakers, and in the 
“Gender” dataset, we test on a new gender, giving us an extreme train/test mismatch. [25]. 


8 














A Proofs of Theorems 1 and 2 


Proof of Theorem 1. 1) 

J —s 

J' ) J ^{x,gt)<T^{x,gH)<T^'^ 

= J d^iig)dn{g')Et (s - max((x, gt ), (z, g't))). 

where the second equality is by Fubini theorem and the last one holds since for a, 6 G [—s, s] : 

/ Ia<rI&<T<i'r = s — max(a, b). 

-S 

Recall that the sampling of t is the following for e G (0,1) let: 

f = n ~ A/” ^0, I if l|tr ||2 < 1 + e, f =-L else , 

since our group is unitary, x being norm one, and by virtue of this sampling the dot product 
\{x,gt)\ < ||«||2 < a/ 1 + e < 1 + e . Hence {x,gt) G [—(1 + e), 1 + e], and we can choose 
s = 1 + e. Using again the fact the group is unitary and compact we have: 

Ks{x,z) = J d^i{g)d^i{g')Et{s - max (^(g~^x,t) , f . 

Now using this particular sampling of templates we have: 

K, 

Let 


(]I||„I| 2 <i+£ [l+e-max ((5 ^x,n) ,{g' ^z,t 
ZxA'n, 9 , 9 ') = max {{ 9 ~^x, n) , {g'~^z, n)) , 


It follows that: 
K, 


A, dg{g)dg{g')En (]I||„|| 2 <i+^ [I + e - g, 5 ')]) 

= (1 + e)P(||n ||2 dg{g)dg{g')E„ g, p')) 

= (1 + £)P(||n ||2 <l + e)- dAg)dg{g')En ((1 - I||„|| 2 >i_^JZ^, 4 n, p, p')) 

= (l + e)P(||n ||2 < 1 + e) - / f dfj.{9)dA9')^nZx,zin, g, g') 

Jg Jg 

^ IgIg ^^^^d)dg{g')Er^ (l||„||2>i_^^Z^,^(n, g, gf (1 


We are left with evaluating or bounding two expectations: /i = ^(n, p, p'), and I 2 = 

E„ p, p') j , that involve the maximum of correlated gaussian variables as we 

will see in the following. 

By rotation invariance of Gaussians we have that and n) are two correlated 

random gaussian variables with correllation coefficient that we note by cos(0g g/) = A~^x, g ~^z). 
Hence by a change of a basis we can write: 


{g = id' = f=COS{9g,g>)u dr - COS^{eg^g,) 


Vd 


Vd 


where cos(0g_g') = (p ^x,g' ^z), and m, u ~ A/’(0,1) iids. 
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Hence, 


'' = 73'^’ 


max C0S{9g,gl)u + — COS'^(0g^g/)v^ . 


The following Lemma from [26] gives the expectation and the variance of the maximum of two 
gaussians with correllation coefficient p. 

Lemma 2 (Mean and Variance of Maximum of Correlated Gaussians [26] ). Let X ~ Af{px, 
and Y ^ M{py,cFy)’ correlated gaussians with correllation coefficient p. Define (pj\/{x) = 

-^1= exp(—a:^/2), and = J^^(j)jy{x)dx. Let a = \Jcr\+ a\ — 2paxcrY, and a = 

MX-MV 

a 

The mean pz and variance of Z = max(2f, Y) are expressed analytically as follows: 

yz = px^Af{oi) + pY^N{-a) + afjyia). ( 9 ) 

= i^x + dSc) ‘J’AA(ck) + (tjy + Py) + {tx + Mf) afijyia) —p\. (10) 

'-V-' 

EZ2 


Applying Lemma 2 to our case [px = Py = 0, ctjc = cry = l,p = cos(6>g_g/)). We have; 
a = y^2(l — cos(0g_g/)) and a = 0. 

h = 

^ =Y^ 2 (l-cos( 6 »g,g/)) 


•\/27rd 

1 

s/2TTd 


\g ^x-g' 


We turn now to ffi that we bound using Cauchy-Schwarz inequality; 


\h\ = 


< 


\/^ih'^\\l>i+e)\/EiZ^^^in, g, g')) 

Y^P (||n||2 > 1 +e) ^E{Zl ,,{n,g,g')). 


On the first hand, applying again Lemma 2 (for EZ^) we have; 

E{Zl^^{n, g, g) = ^max cos{0g^gi)u + cos^{9g^g')i 


= ^(2$V(0)) 

_ 1 
d 


(11) 


( 12 ) 


(13) 


On the other hand, note that ||n ||2 has a (normalized) chi squared distribution with d degree of 
freedom Xd ’ mean 1 . The following Lemma gives upper bounds for the upper and lower tails 
of a chi square distribution. 

Lemma 3 (x^ tail bounds). Let X ^ x% a chi squared random variable with k degree of freedom. 
The following hold true for any e G (0,1).‘ 

• Upper Bound for the upper tail [27]: P i^\X > 1 + e) < 

• Upper Bound for the lower tail [28 ]: For all k > 2, u > k — 1 we have: 

P (A" < u) < 1 — i exp ( — ^{u—k—{k — 2) log{u/k) + log(fc)) j . 
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More specifically for u = fc(l + e) we have: 


P 


\k / “ 2 


Applying Lemma 3, for ||n|| 2 . We have ||n ||2 = where X ^ Xd’ hence: 


> 1 + e) < e"*"/®, 


Putting together Equations (12),(14), (13) we have finally: 


I/2I < 


^-de^/16 

\fd 


( 14 ) 


(15) 


Putting together Equations (8), (11), and(15), and using upper and lower bounds for P(||n|| 2 < 1+e) 
from Lemma 3: 


Ks{x,z) < (1 + e)P(||n ||2 < 1 + e)- 


y/2'Kd 


G JG 


g ^x-g' ^z\\ dp.{g)dp.{g) + 


^-de^llQ 

Vd 


( 1 6 2 


y/2TTd 


G JG 


\g ^x-g' ^z\\^dg{g)dg{g') 


g-de^/16 

Ks{x,z) > (1+e)P(||n ||2 < 1+£) - 


d/r(p)dAt(p')- 

V^TTdJGJG vd 


> 


(l + £)(l-e --^= I I ||p ^x-g' ^z\\^dp,(g)dg{g') - 


y/2'Kd 


G JG 


„-de^/16 

sfd 


Noting by da the integral and using that the group is compact and unitary: 

1 


dG{x,z) = 


G JG 


g ^x-g' ^z\\^dg.{g)dg.{g') 


\/2'Kd 

v)= Wax - g'z\\^ dfi{g)dfi{g'). 


We finally have: 

g-de^/16 


„-de /16 1 g-ed/2 Q I g'j 2 

-^=-(l + £)e"* /®+£ < Ks{x,z)-{1 - dG{x,z)) < - — --- — -h£. 


Eor any £ G (0,1) , as the dimension d 00 , we have asymptotically: 

Kfix, z) ^ 1- daix, z) + e = s - dcix, z). 


(16) 


2) The symmetry of K is obvious. Let p{t) be the distribution of the templates t. Define the 
following weighted dot product: {f{x, ■),g{z,p{t) dTf{x, t, t)p(z, t, t). Recall 
that: 

Ks{x,z) = J p{t)dt j 'ijj{x,t,T)i:{z,t,T)dT 

= {i’{x,-,-),'ip{z,.,.)). 

Hence K is symmetric and positive semidefinite. 

□ 
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Proof of Theorem 2. In the following we fix two points x and z in Af and a random template t. Let 
Xj = F((gtj,x} < T)P((gtj, z) < T)dT, we have 0 < Xj < 2s, where s = 1 + e. Recall that 
Ks{x, z) = ^j)- Hoeffding’s inequality we have; 


{ m 

— Xj - Ks{x,z) 
j=i 

Turning now to the CDF r) = V{{gt,x) < t), and the empirical CDF 'f{x,t,T) = 

fGl' Sl=i ^{git,x)<T- By the theorem on convergence of the empirical CDF [29] (Theorem 4 given 
in Appendix D ) we have, for 7 > 0: 


> e > < 2 exp 


—2me^ 

7^ 


Pg < sup 


i’ix, t, t) - ip{x, t,T) > 7 ^ < 2 exp(- 2 |G| 7 ^) 


Hence we have Vr G [—s, sj: 

1p{x,t,T) - 'f{x,t,T) 


< 7 and 


ll>{x,t,T) - f{z,t,T) 


< 7 


with a probability at least 1 — 4exp(— 2 |G| 7 ^). 

Define X = J^^'ip{x,t,T)'ip{z,t,T)dT, X = J^^'ip{x,t,T)'f{z,t,T)dT, and X 

^ EL-n Ux, t, t, ^), choose 0 < 7 < 1 : 


|X-X| = 


< 

< 



■{x, t, t)'4){z, t, t) - -fix, t, T)i}{z, t, t )^ dr 

'x, t, r) - -fix, t, t) + -fix, t, T)j (j){z, t, t) - ip{z, t, t) + ip{z, t, t)^ 


(27 + 7^)2 s 

6s7, 


— Ipix, t, T)lp{z, t, T)dT 


with probability 1 — 4exp(— 2 |G| 7 ^). Define Xj = J^^ip{x,tj,T)'ilj{z,tj,T)dT, Xj = 
J^^i!{x,tj,T)tp{z,tj,T)dT, and Xj = ^ X^L-n^( 2 ^,v)’ Then for all j = 


1 ... m, we have 

\Xj -Xj\< 6s7 

with probability 1 — 4mexp(— 2 |G| 7 ^) — 2 exp ^ ) ' 

Now we turn to the numerical approximation of the integra by a Riemann sum, we have for all 
j = 1.. .m : 


Xj - X, 

Hence the error decomposes in the following way; 



|($(a;),$(z)) - Ksix,z)\ = 


^'^Xj Ks{x,z) 


i=i 


m -.m \ / ^ m -.m \ / 

-Ee--Ee + -EE--Ee + - 

J-i j-i J \ 3-1 3-1 / V 


EE - Ks{x,z) 


i=i 


< 

- m .J 771 

-EE--EE 

+ 

^ m ^ m 

-EE--Ee- 

+ 

. m 

— EE -Ks{x,z) 


i=i i=i 

✓ s 

i=i i=i 

/ s 

-N 

1 


Numerical Binning Error 


Group CDF Approximation Error 


Templates Concentration Error 


< — + 6 s 7 + e. 
n 


12 
































with probability 1 — 4 to exp(—2|G|7^) — 2 exp ■ For this to hold on all pairs of points in 

a set of cardinality N we have: 

|($(a:i),4>(xj)) - K{xi,Xj)\ < ^ + 657 + e,i = 1... N, j = 1... N, 

with probability 1 — A'mN{N — 1) exp(—2|G|7^) — 2N{N — 1) exp ■ 

Hence we have for numerical constants Gi, and C 2 , 0 < ^ 1,^2 < 1, and 0 < £o,Si,e 2 < 1, for 
n>^,m>^log(f),|G|>f log(i^),: 

|($(a:i), ^>( 3 ::^)) - Ks(xi,Xj)l < £0 + £1 + £ 2 , i = 1 ■ ■ ■ j = 1 ■ ■ ■ 
with probability 1 — (5i — 62 . 

□ 


B Proof of Theorem 3 


Proof of Lemma 1. Our proof parallels similar proofs in [16]. Note that functions of the form (7) 
are dense in "Hk. f{x) =Y,iOtiKs{x,Xi) = J]* 'r)V'(a;z, f, T)p(f)dfdT 

= / /.!s Let/3(f,r) = p{t)J2iCti'4’{xi,t,T), since 0 < 

'f{x,t,T) < 1, 'ix,t,T, we have < Yhi \<^i\ < since Ui are finite. Hence / can be 

written in the form: 

f{x) = J J P{t,T)ip{x,t,T)dtdT, sup< 00 , 


and / e Fp. 


□ 


In order to prove Theorem 3, we need some preliminary lemmas. The following Lemma assess the 
approximation of any function / G Fp, by a certain f G F. 


Lemma 4 {F Approximation of Fp). Let f be a function in Fp. Then for Si, 82 > 0, there exists a 
function f G F such that: 


f-f 


< 


2sC 


C^{X,px) 

with probability at least 1 — (5i — 52 - 


1 + \ 2 log 


2sC 


^1 + y^2iog 


25 G 

n 


Proof of Lemma 4 . Let f G Fp, f{x) = J w{t,T)ilj{x,t,T)dTdt. 

Letfjix) = J^^^^^^^'iljix,tj,T)dT, fj{x) = tj,T)dT, ttnd fj{x) = 

v)- We have the following: Etifj) = /, and ^Et{J2T=i fi) = /• 
Consider the Hilbert space C?{X, px), with dot product: (/, g)c2(^x px) — fx f{x)g{x)dpx{x). 
Note that: g(T)dT < s/2s>\J jlsg^ij)dT 

\\f3\\c^(x,px) = \lJ^(^J dpx{x) < (2sG), 

Fix (5i > 0, applying Lemma 7 we have therefore with probability 1 — (5i: 


1 

m 


T.S,-! 

i=t 


< ^ 
“ ^/m 

cHx,px) 



(17) 
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Now turn to: 


1 

m 


i=i 


CHx,Px) 



CHX,p;^) ’ 


fj - fj 


_ w{tj,T) 

C-‘^{X,Px) Jx yj-s pi^j) 


{'ip{x,tj,T) - '4){x,tj,T))dT dpx{x) 


< 2s 


s ,,,2C+. 

{lp{x,tj,T) - 1p{x,tj,T))'^dTdpxix) 


xJ-s P^itj) 


<2sC^ / / {ip{x,tj,T) — 'ip{x,tj,T))‘^dTdpxix) 


X J-s 


= 2sC^ 


= 2sC^ 


{-lp{x,tj,T) - 'lp{x,tj,T)Ydpx{x)dT 


l-B Jx 




CHX,PX) 
2 


dr 


<{2sC) sup 




£=(^.pa-) 


Recall that: tp{x, t,T) = ^{git,x)<T, and ^^{x, t, t) = V.g'4){x, t, t). 


Clearly || I<.,gt)<T || £ (^x p ) — applying again Lemma 7, for ^2 > 0 we have with proba¬ 

bility 1 — (52: 




< — I l + t/21og( ^ 


It follows that: Vj = 1... m, 


fj-f: 


Jj ^ ^ 

Hence with probability 1 — 17162 , we have: 


^ ' c.'^{x,px) IGj 

^1 -f 2 log > with probability 1 — m 52 - 


< 


2Cs 


1 




i=i 


< 


C^X,px) 


2Cs f 

^ V 


1 -f \ 2\og 


(18) 


and by the approximation of a Riemann sum we have that: 




i=i 


< 


2sG 


(19) 


C^{X,px) 


It is clear that / = ^ Sjli /i ^ ^^ hence, putting together equations (17),(18), and (19) we finally 
have: 

1 


Y.h -1 


i=i 


< 


C^{X,px) 


1 




< 


i=i 
2sG 2Gs 


-f 


C^(X,px) 




i=i 


C^{X,Px) 


. m 


i=i 






2sG 


1 + t/21og 


with probability 1 — (5i — 17162 - 


□ 


The following Lemma shows how the approximation of functions in J^p, by functions in T, translates 
to the expected Risk: 


/ 

C^iX,px) 
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Lemma 5 (Bound on the Approximation Error). Let f G J-p, fix i5i, ^2 > 0. There exists a function 
f G T, such that: 


Svif) < £v(f) + ^ i (l + \/2>»8(|)) 

with probability at least 1 — 5i — 52- 



Proof of Lemma 5. Evil) - ^v{f) < Sx ^(2//(^)) “ ^(2//(^)) dpx{x) < Lj^\f{x) - 


f{x)\dpx{x) < lJJ x{f{x)- f{x)Ydpx{x) = L f - f 

'I C -‘{ X , px ) 

chitz condition and Jensen inequality. The rest of the proof follows from Lemma 4. 


where we used the Lips- 
□ 


The following Lemma gives a bound on the estimation of the expected Risk with finite training 
samples; 

Lemma 6 (Bound on the Estimation Error). Fix (5 > 0, then 


sup 

feF 


£vif)-£vif) 

with probability 1 — A 


< 






Proof The proof follows from Theorem 5 given in Appendix D. It is sufficient to bound the 
Rademacher complexity of the class Y: 




sup 

./GJ^ 


1 ^ 


2=1 




sup 


s 


N 


i—1 \i=l k——n 


= E, 


< 


sup 

/G.F 

sC 

' mNn 


s 

'n^ 


m n 


N 


j—1 k——n i—1 


m n 

EE 

j—1k=—n 


N 


'^CTitp ( Xi,tj, — 


By Holder inequality: {a,b) < ||«|lc 


^ m n 

< E^ V y . 

mNn ^ ^ \ 

j—1k——n y 


N 


sk 


Ect j (JiY (xi,tj, — j j Jensen inequality, concavity of square root 


Note that E(cricr_, ) = 0, for i Y j tt follows that: 


T,f=i '0^ {xi,tj, v) ^ since tp{.,.) < 1. Einally: 

.r, Cs 

^ - 7^- 


□ 


We are now ready to prove Theorem 3: 

Proof of Theorem 3. Let f% = argmin^g^y(/), / = argmin^^^ (/), fp 

argmin^g^^£y(/). 

Evir^) - mm Evif) = (fy(/^) - Evifj) + (Evil) - Evifp)) 

Statistical Error Approximation EiTor 
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The first term is the usual estimation or statistical error than we can bound using Lemma 6 , we have: 

svir^) - £v(f) = (svir^) - £v{rN)) + [svirN) - svii)) + (svif) - £v{ 


<0,by optimality of 


< 2 sup 


£v{f)-Mf) 


’■7W (^«*c: + 2r,o) + Lcy-i„g^- 

with probability 1 — 5 over the training samples. Let fp, the function defined in Lemma 4, that 
approximates fp in T. By Lemma 5 we know that: 

ev{U} < Svih) + ^ (^)) + ^ (‘ + \/2'“s(£))+^)- 

with probability 1 — 5i — 82 , on the choice of the templates and the sampled group elements. By 
optimality of f G T, we have 

2 sC / 


£v{f) < £v{fp) < £v{fp)+ 


2sLC 


1 + \ 2 log 


1 


+L 


1 


/ 21 og 


m 


2sC 


/ITT- \ V J ^ \ V ' V ^ 

Hence by a union bound with probability 1 — 5 — 5i — ^ 2 , on the training set, the templates and the 
group elements we have: 

£v{fN) - mjn £vif) < 2-^ UlsC + 2V{0) + LC^j^logj^ 


2sLC 


1 + t 2 




□ 


C Technical tools 


Theorem 4. [29] Let Xi, X 2 ,..., be i.i.d. random variables with cumulative distribution func¬ 
tion F, and let Fm be the associated empirical cumulative density function Fm = ^ 'Y^=i ^Xi<T- 
Then for any 7 > 0 


|sup Fm{T) - F{t) > 7 I < 2exp (-2my‘^) . 


Lemma 7 ([15],Concentration of the mean of bounded random variables in a Hilbert Space). Let 
{Tl, {., be a Hilbert space. Let Xj, j = I... K, be iid random, such that \ \Xj \ < M. Then 

for any 5 > 0, with probability 1 — 5, 


K K 


K 


i=i 




M 


H 




Theorem 5 ([15]). Let F be a bounded class of function, sup^g;^< 1/(2:) | < C for all f G T. 
Let V be an L-Lipschitz loss. Then with probability 1 — 5, with respect to training samples 
{xi,yi}i=i...N,every f satisfes: 

£vif) < £v{f) + 4L7^^(J■) + + LC^^logi 

where 'R-n{F) is the Rademacher complexity of the class T: 

N 


TIn{F) = 


sup 

/6^ 


-Y^a,f{x,) 


the variables Ui are iid symmetric Bernoulli random variables taking value in {—1,1}, with equal 
probability and are independent form Xi. 


16 

































D Numerical Evaluation 

D.l Permutation Invariance Experiment 

For our first experiment, we created an artificial dataset which was designed to exploit permutation 
invariance, providing us with a finite group to which we had complete access. The dataset Xperm 
consists of all sequences of length L = 5, where each element of the sequence is taken from an 
alphabet A of 8 characters, giving us a total of 32,768 data points. Two characters ci, C 2 £ A were 
randomly chosen and designated as targets, so that a sequence x £ Xperm is labeled positive if 
it contains both cl and C 2 , where the position of these characters in the sequence does not matter. 
Likewise, any sequence that does not contain both characters is labeled negative. This provides us 
with a binary classification problem (positive sequences vs. negative sequences), for which the label 
is preserved by permutations of the sequence indices, i.e. two sequences will belong to the same 
orbit if and only if they are permuted versions of one another. 

The character in A is encoded as an 8-dimensional vector which is 0 in every position but the i*, 
where it is 1. Each sequence x £ Xperm is formed by concatenating the 5 such vectors representing 
its characters, resulting in a binary vector of length 40. To build the permutation-invariant represen¬ 
tation, we project a binary sequences onto an equal-length sequence consisting of standard-normal 
gaussian vectors, as well as all of its permutations, and then pool over the projections with a CDF. 
As a baseline, we also used a bag-of-words representation, where each x £ Xperm was encoded 
with an 8-dimensional vector with element equal to the count of how many times character i ap¬ 
pears in X. Note that this representation is also invariant to permutations, and so should share many 
of the benefits of our feature map. 

For all classihcation results, 4000 points were randomly chosen from Xperm to form the training set, 
with an even split of 2000 positive points and 2000 negative points. The remaining 28,768 points 
formed the test set. 

We know from Theorem 3 that the expected risk is dependent on the number of templates used to 
encode our data and on the number of bins used in the CDF-pooling step. The right panel of Figure 
4 shows RLS classification accuracy on Xperm for different numbers of templates and bins. We see 
that, for a hxed number of templates, increasing the number of bins will improve accuracy, and for a 
hxed number of bins, adding more templates will improve accuracy. We also know there is a further 
dependence on the number of transformation samples from the group G. The left panel of Figure 
4 shows how classihcation accuracy, for a hxed number of training points, bins, and templates, de¬ 
pends on the number of transformation we have access to. We see the curve is rather Hat, and there 
is a very graceful degradation in performance. 

In Figure 5, we include the sample complexity plot (for RLS) with the error bars added. 



Figure 4: Left) Classification accuracy of random invariant features as function of the number of 
sampled group elements on Xperm- Right) Classification accuracy of random invariant features as 
function of the number of templates and bin sizes on Xperm- 
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♦ Raw 

^ Bag-Of-Words 

»CDF 25,10) 
*CDF{25,25) 


Figure 5: Classification accuracy as a function of training set size. $ = CDF{n, m) refers to a ran¬ 
dom feature map with n bins and m templates. For each training set size, the accuracy is averaged 
over 100 random training samples. With enough templates/bins, the random feature map outper¬ 
forms the raw features as well as a bag-of-words representation (also invariant to permutation). We 
also train an RLS classifier with a haar-invariant kernel, which naturally gives the best performance. 
However, by increasing the number of templates, we come close to matching this performance with 
random feature maps. 


D.2 TIDIGITS Experiment 

Here, we add plots (Figures 6,7 and 8) showing performance as a function of number of templates 
and bins for some other splits of the TIDIGITS data. 


TIDIGITS Utterance RLS 



10 100 1000 

Number of Templates 


Figure 6: Mean classification accuracy as a function of number of templates, m, and bins, n. Ac¬ 
curacy is averaged over 30 random template samples for each m and error bars are displayed. In 
the “Utterance” dataset, we train and test on the same speakers, but the test set contains new utter¬ 
ances of each digit. This is the easiest dataset, representing only intraspeaker variability, and the 
performance is quite good even for a small number of bins. 
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Figure 7; Mean classification accuracy as a function of number of templates, m, and bins, n. Accu¬ 
racy is averaged over 30 random template samples for each m and error bars are displayed. In the 
“Age (Women)” dataset, we train on adult women and test on children, giving us an age mismatch. 
Despite this mismatch, performance remains strong. 
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Figure 8; Mean classification accuracy as a function of number of templates, m, and bins, n. Ac¬ 
curacy is averaged over 30 random template samples for each m and error bars are displayed. In 
the “Age (Men)” dataset, we train on adult men and test on children, giving us an age mismatch. 
We see the weakest performance in this dataset, much worse than on the “Age (Women)” dataset. 
This is possibly due to the fact that women have higher pitched voices than men, creating less of a 
mismatch between women and children than men and children. 
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